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ABSTRACT 



A new technique for extracting a hierarchical decomposition 
of a complex video selection for browsing purposes, com- 
bines visual and temporal information to capture the impor- 
tant relations within a scene and between scenes in a video, 
thus allowing the analysis of the underlying story structure 
with no a priori knowledge of the content. A general model 
of hierarchical scene transition graph is applied to an imple- 
mentation for browsing. Video shots are first identified and 
a collection of key frames is used to represent each video 
segment. These collections are then classified according to 
gross visual information. A platform is built on which the 
video is presented as directed graphs to the user, with each 
category of video shots represented by a node and each edge 
denoting a temporal relationship between categories. The 
analysis and processing of video is carried out directly on the 
compressed videos. Preliminary tests show that the narrative 
structure of a video selection can be effectively captured 
using this technique. 

16 Claims, 13 Drawing Sheets 
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METHOD AND APPARATUS FOR VIDEO use a priori models of the expected content of the video clip 

BROWSING BASED ON CONTENT AND to parse the clip. 

STRUCTURE In reality, many moving image documents have story 

structures which are reflected in the visual content. Note that 

This is a continuation of application Ser. No. 08/382,877, 5 a complete moving image document is referred to herein as 

filed Feb. 3, 1995 now U.S. Fat. No. 5,708,767. a video clip. The fundamental unit of the production of video 

is the shot, which captures continuous action. A scene is 

FIELD OF THE INVENTION usually composed of a small number of interrelated shots 

The present invention relates generally to methods and that are uoified b Y location or dramatic incident. Feature 

apparatus for browsing of video material, and more specifi- 10 films are l yP icall y divided into three acts, each of which 

cally to automating the browsing process. consists of about a hau>dozen scenes. The act-scene-shot 

decomposition forms a hierarchy for understanding the 

BACKGROUND OF THE INVENTION story. News footage also has a similar structure: a news 

The ability to browse through a large amount of video c ^ m isdivided Stories ' e t ach ° f which ^ally starts 

material to find the relevant clips is extremely important in 15 ^7 ^ ™ Sl ^ m ^ of * ™ W T* 9 

many video applications. In interactive TV and pay-per- h each "*aiiiing *ye«d shots and perhaps 

view systems, customers want to see sections of programs mulUple SCeneS " * ^ ^ taerarch * a 

before renting them. While prepared trailers may suffice to ^ , may Consjsi of alternating shots of the two main 

publicize major motion pictures, episodic television, sports, on ^hanutos or an interviewer and interviewee. At the higher 

and other programs will probably require browsers to let the 20 le h Vels ° f ab K stractl0 °> ™* ™™ °/ * story in a news 

customer find the program of interest. In the scholarly * h ° W may ^/Jf aled ^ a ™ ual m ° l * su <* » a ° 

domain, digital libraries will collect and disseminate moving ™ * m 1 

image material. Many scholars, including political scientists, ' 

psychologists and historians studv moving images as pri- In view of such underl ying structures, a video browser 

mary source material. They require browsers to help them 25 shouId a ^ w the user to first identify me scenes takmg place 

find the material of interest and to analyze the material. al that locatlon visual information, select the scene 

Browsing is even more important for video than for text- desired tem P° ral information, and similarly navigate 

based libraries because certain aspects of video are hard to throu S h the vanous shots 1D the ^8 both ™ u * y and 

synopsize temporal information. Thus there is a real need to identify 

Early browsers were developed for video production and 3 ° b ° th visua | and tenapoial relationships to allow the user to 

as front ends for video databases. Today's standard tech- T^T ? St0fy StrUCtUre and DaV1 S ate t0 the 

nique for browsing is storyboard browsing in which the P ° mt m the Vlde °* 

video information is condensed into meaningful snapshots SUMMARY OF THE INVENTION 

representing shots while significant portions of the audio can 35 i n oa e embodiment of the invention, the story structure is 

be heard. One known browser divides the sequence into modeled with a hierarchical scene transition graph, and the 

equal length segments and denotes the first frame of each scenic structure is extracted using visual and temporal 

segment as its key frame. This browser docs not detect scene information with no a priori knowledge of the content— ihc 

transitions and it may both miss important information or structure is discovered automatically. A hierarchical scene 

display repetitive frames. Another known browser stacks 40 transition graph reflects the decomposition of the video into 

every frame of the sequence and provides the user with rich actSf scenes ^ shots . S uch a hierarchical view of the video 

information regarding the camera and object motions. provides an effective means for browsing the video content, 

However, a scholar using a digital video library or a cus- smce long sequences of related shots can be telescoped into 

tomer of a pay-per-view system is more interested in the a small number of key frames which represent the repeatedly 

contents (who, what, where) than how the camera was used 45 appearing shots in the scene. In addition, the analysis and 

during the recording. processing of video is carried out directly on videos com- 

A first step toward content-based browsing is to classify pressed using common compression standards such as 

a long video sequence into story units, based on its content. Motion JPEG and MPEG. 

Scene change detection (also called temporal segmentation The present inventive system presents the content- 
of video) gives sufficient indication of when a new shot 50 structure of the video in a way that appeals to basic human 
starts and ends. Scene change detection algorithms, an understanding of the visual material in video. Also, a frame- 
algorithm to detect scene transitions using the DOT coeffi- work is provided on which the navigation in video databases 
cients of an encoded image, and algorithms to identify both can be facilitated both by machine automation and user 
abrupt and gradual scene transitions using the DC coeffi- interaction. 

cients of an encoded video sequence arc known in the art. 55 

Beyond temporal segmentation of video, one known DESCRIPTION OF THE DRAWINGS 

browser uses Rframes (representative frames) to organize Various embodiments of the invention are described and 

the visual contents of the video clips. Rframes may be illustrated herein with reference to the accompanying 

grouped according to various criteria to aid the user in drawings, in which like items are identified by the same 

identifying the desired material: the user can select a key 60 referen ce designation, wherein: 

frame, and the system then uses various criteria to search for FIG. 1 is a block diagram and flowchart for one embodi- 

similar key frames and present them to the user as a group. men! of the invention; 

The user could search representative frames from the FIG. 2 shows video shots of the 1992 Democratic Con- 
groups, rather than the complete set of key frames, to vention in a sequence based upon temporal order; 
identify scenes of interest. It is known to use a language- 65 FIG. 3a shows clustering results from video shots of the 
based model to match the incoming video sequence with the 1992 Democratic Convention for one embodiment of the 
expected grammatical elements of a news broadcast, and to invention; 
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FIG. 36 shows clustering results from a News Report, for 
an embodiment of the invention; 

FIG. 4 shows an initial scene transition graph of the 1992 
Democratic Convention video sequence, from an embodi- 
ment of the invention; 

FIG. 5 shows an initial scene transition graph of the News 
Report sequence, for an embodiment of the invention; 

FIG. 6. shows a top-level scene transition graph of the 
1992 Democratic Convention video sequence, for one 
embodiment of the invention; 

FIG. 7 shows a top-level scene transition graph of the 
News Report video sequence, for an embodiment of the 
invention; 

FIG. 8 shows a block schematic diagram of a system for 15 
implementing the methods of various embodiments of the 
invention; and 

FIGS. 9, 10, 11 and 12 show flowcharts for various 
embodiments of the invention, respectively. 

20 

DETAILED DESCRIPTION OF TOE 
INVENTION 

Scene Transition Graphs and Video Browsing 



Property (2) deals with grouping of shots at the lowest 
level of hierarchy. The collection f shots is partitioned into 
nodes of G 0 ; each node represents a cluster of shots, which 
are considered a scene in the general sense. A directed edge 
is drawn from node U to W if there is a shot represented by 
node U that immediately precedes some shot represented by 
node W. Further grouping into other levels of the hierarchy 
is defined in a similar fashion in property (3). The edge 
relationships induced by temporal precedence at level 0 is 
preserved as one moves up the hierarchy. 

EXAMPLES 

Example 1 

Tree Representation of Shots 



F 0 partitions {s ( } into V 0(1 -{s u . . . , s L }, V^ 2 -{ 8i . +l> . . . , s*.}, . 

(4) 

partitions {V^} into V^V^, . . . ,V W } ( V^V^ .... 
v o.2l} - -, and so on. (5) 

This is a hierarchical organization in time of the collection 
of shots. At the lowest level, each node V 0 ^. represents L 



'Jliis section first defines the general model of hierarchical 25 shots; a directed edge connects V 0/ to V 0 VJ> giving a 



scene transition graphs, presents the notations and defini- 
tions of the terms used herein, and finally provides examples 
that the model represents. 

Notation and definitions 30 

A shot is defined as a continuous sequence of actions, as 
a fundamental unit in a video sequence. The ith shot is 
denoted by: 



(l) 35 



where f- is the jth frame and b £ is the beginning time of the 
shot and e,- is the ending time of the shot. The relation 
b 1 <e 1 <b 2 <e 2 <b 3 < ... is also assumed. In addition, we call 
a collection of one or more interrelated shots a scene. Many 
criteria can be imposed to determine if the shots are related 
or not. 

A Hierarchical Scene Transition Graph (HSTG) on {s ( J is 
a collection of directed graphs 



40 • 



{<*} 



45 



with H denoting the total number of levels in the hierarchy, 
which has the following properties: 



50 



G 0 =(Vo, Eq, F 0 ) 



(2) 



directed path as the structure of the graph at this level. At the 
next level, each node V 1>f represents L nodes of G 0 . Such a 
tree hierarchy permits a user to have a coarse-to-fine view of 
the entire video sequences. 

The temporal relation defined by the edge in this example 
is very simple. The relations between shots in a video tends 
to be more complicated. The next example defined {Fj to 
allow a better representation of content-based dependency 
between the shots. 

Example 2 
Directed graph representation of shots 

There is one level of hierarchy, i.e. H=l. F 0 partitions {s,-} 
into V 0 fl) V 02 , . . . , such that nodes in each V 0 ^- are 
sufficiently similar, according to some similarity measured 
in terms of low level vision indices such as colors, shapes, 
etc. 

In this case, shots that are similar to each other are 
clustered together. Relations between clusters are governed 
by temporal ordering of shots within the two clusters. A 
simple example would be a scene of conversation between 
two persons; the camera alternates between shots of each 
person. The graph G 0 consists of two nodes V 01 and V 0j2 ; 
( v o,i» ^0,2) and (V 0f2 » v o,i) are Dom members of Eo. 

It should be noted that many implementations can fit in 
this model. One possible browsing system embodiment is 
shown below under such a generalized framework in the 



where V 0 ={V 0 -} is the node set and E 0 is the edge set. F 0 
is a mapping that partitions {s ( j into V 01 , V 02 , . . . , the 

members of V 0 . For given U, W, in V 0 , (U,W) is a member 55 following sections. Extensions of this model to include more 

of E 0 if there exists some s A in U and and s m in Wsuch that levels of the hierarchy is possible by using semantic 

na^l+L information, derived from further processing or from human 

For i=l,2, . . . , 1 1— 1 : interaction, as well as temporal information. 



G=(V i3 E„ Fl ). 



(3) 



Each V ( - and E, is the node set and edge set respectively. F ( - 
partitions into V. 4 , V /2 , . . . , the members of V,-. For 
given U, W, in V /5 (U,W) i's a member of E ( if there exists 
some v ; in U and and v„ ( in W such that (v^v,,,,) is a member 
of • 

The hierarchy is uniquely defined by the action of {F £ } on 
the collections of shots {s,}. 



60 



A Method For Video Browsing For One 
Embodiment 



A block diagram approach for one embodiment of the 
invention is shown in FIG. 1. The browsing process is 
automated to extract a hierarchical decomposition of a 
65 complex video selection in four steps: the identification of 
video shots, the clustering of video shots of similar visual 
contents, the presentation of the content and structure to the 
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users via the scene transition graph, and finally the hierar- FIG. 2. Each shot is visually represented by its representa- 

chical organization of the graph structure. tive DC image. This presentation format relieves the user 

All processing aspects of this system embodiment are from the need to watch the entire video during browsing, 

performed on reduced data called DC-scqucnce derived However, a typical one hour program can produce hundreds 

directly from compressed video using Motion JPEG or 5 of different shots. In such cases, presenting shots in a one 

MPEG. A DC-sequence is a collection of DC images, each dimensional image array does not offer the users an effective 

of which is formed by blockwise averaging of an original and efficient means to browse, navigate and search for any 

image. In the case of a discrete cosine transform (DCT) particular video segments. This presents a greater challenge 

based compression scheme, as in JPEG and I-frame of for the ^ T ^ the user nas never watched the video in its 

MPEG, the DC image is the collection of DC coefficients 10 entirety, and has no idea where in the time array the search 

scaled by 8. The DC images in P and B frames of MPEG snould start * 

compressed sequences can be derived in terms of the DCT A hierarchical scene transition graph offers a better orga- 
coeflicienls of the anchor frames, thus permitting the extrac- nization of video contents than existing story-board brows- 
tion of DC sequences directly from MPEG sequences. No ing schemes and facilitates the browsing process. Clustering 
full frame decompression is necessary. This approach has 1S of video shots is the first step toward the building of the 
the advantage of computational efficiency because only a graph. The sequence {s ; } is grouped into clusters, denoted 
very small percentage of the entire video sequence is used. by V 01 , V 0 2 , ... by measures of similarity. Verification of 
In addition, the use of DC sequence for processing permits the clustering results is done through user interaction. Some 
a uniform framework for scene and content analysis on degree of user verification is necessary to allow flexibility 
Motion JPEG and MPEG compressed video embodied in the 20 and because no a priori model is assumed for the content- 
system, structure of the video under investigation. 

The identification of video shots are achieved by scene Similarity of shots 

change detection schemes which give the start and end of ^ 

each shot. A detection algorithm as described by B. L. Yeo Low level vision analyses operated on video frames 

and B. Liu, in "Rapid scene analysis on compressed videos", 25 achieve reasonably good results for the measurement of 

an unpublished paper submitted to IEEE Transactions on similarity (or dissimilarity) of different shots. Similarity 

Circuits and Systems for Video Technology, is used, and the measures based on image attributes such as color, spatial 

Yeo and Liu paper is incorporated herein by reference in its correlation and shape can distinguish different shots to a 

entirety to the extent it does not conflict herewith. The significant degree, even when operated on much reduced 

algorithm handles with high accuracy both abrupt changes 30 images as the DC images. Both color and simple shape 

and special effects like gradual transitions which are com- information are used to measure similarity of the shots, 
mon in the real-world video. The shots that exhibit visual, 

spatial and temporal similarities are then clustered into Color 

scenes, with each scene containing one or more shots of Most of the video shots encountered everyday are shots of 

similar contents. Each such cluster is a basic note in the 35 real scenes, real people and realistic motions. It is rare that 

scene transition graph. From the clustering results and the two shots of very different contents will have very similar 

temporal information associated with each shot, the system visual colors. Color is an effective means to distinguish 

proceeds to build the graphs, with nodes representing scenes different shots. Swain and Ballard in their paper "Color 

and edges representing the progress of the story from one Indexing", Internationa I Journal of Computer Vision, Vol. 7, 

scene to the next. The nodes capture the core contents of the 40 pp. 11-32, 1991, pointed out that "although geometrical 

video while the edges capture its structure. The browsing cues may be the most reliable for object identity in many 

approach thus is based on both content and structure of a cases, this may not be true for routine behavior: in such 

complex video selection. While the primitive attributes of behavior wherein familiar objects are interacted with 

the shots contribute the major clustering criteria at the initial repeatedly, color may be a far more efficient indexing 

stage of the scene transition graph construction, hierarchy of 45 feature." The present inventors have confirmed this, and 

the graph permits further organization of the video for have found that in video with underlying story structures, 

browsing purposes. At each level a different criterion is like news broadcasts and night shows, where familiar scenes 

imposed. The lower levels of the hierarchy can be based and people appeared and interacted repeatedly, color is far 

upon visual cues while the upper levels allows criteria that more effective and superior when used to classify video 

reflect semantic information associated. This hierarchical 50 shots than moment invariant. 

organization offers a browsing structure that closely The present system adapts the histogram intersection 

resembles human perception and understanding. method from Swain and Ballard, as cited above, to measure 

The present method offers a better organization of video the similarity of two images based on color histograms. A 

contents than existing storyboard browsing schemes in 55 color histogram of an image is obtained by dividing a color 

facilitating the browsing process. In addition, no a priori space (e.g. RGB) into discrete image colors (called bins) and 

models are assumed. It also allows user interaction to counting the number of times each discrete color appears by 

supplement machine automation in the course of analysis. traversing every pixel in the image. 

This is very crucial in video browsing as perception and Given two images f f and f and their histograms V and P", 

understanding of contents varies with individuals and no 6Q each containing n bins, V k denotes the value of f in the kth 

existing automation schemes have yet to achieve precise bin. The intersection of the histograms is defined to be: 
accuracies. 

Clustering of Video Shots & min (W*0- (< ° 

'llie temporal segmentation process of the video clip 65 This gives an indication of the number of pixels that have 

produces a collection of shots {s,}, arranged in a linear array similar colors in both images. It can be normalized to a 

fashion according to their temporal orderings, as shown in match value between 0 and 1 by the following: 
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jttj 7 * 5 The performance using the luminance projections is com- 
parable to that of using full correlation. Thus equation (11) 

The present system uses mismatch value, D(I', F)=l-S \ 5 use d to represent the measure of correlation in two 

(p,V) to indicate dissimilarity between the two images images, 
instead of the match value defined above. 

In the present implementation, the inventors found that a Temporal variations in a video shot 

8x8x4 discretization of the RGB color space is adequate to 10 

distinguish DC images of size 40x30 pixel square. Since a vldeo shot com P^ many images, similarity 

Shape between the shots needs to be defined beyond that of planar 

The present system uses as another measure of similaritv ima S es - If onl y a representative image is chosen for a video 

between two images the two-dimensional moment invariant shot > ll 15 ver y likel y that there are two shols which are 

of the luminance. However, the inventors discovered that the 15 indeed ver y much the same m content but are J ud 8P d t0 

order of magnitudes in different moment invariant varv different because the representative frames chosen for these 

greatly: in many examples the ratio of first moment invariant two shols are dlfferent - For example, given the Mowing 

to the third or fourth moment invariant can vary by several two shols: the first shot slarts Wlth a ™™d-'m view of 

orders of magnitude Mr.A, the camera then zooms out to include Ms.B; the 

By using the Euclidean distance of the respective moment 20 second shot starts ^ the camera focusing on Ms.B, then 

invariant as a measure of dissimilarity between two images, ll 2001115 out 10 include MrA 0ne ma y meses dao* are 

such distance will most likclv be dominated by the moment simUar < this ha PP ens often m the an chor room during news 

invariant with the highest order of magnitude. The purpose broadcasts), yet if the first image from each shot is taken as 

of using more than one moment invariant for shape match- the representative image, the two representative images can 

ing will then be totally defeated. The present system 25 be drasticaUy different dependmg on me similarity measures 

attempts to scale the moment invariant to comparable orders useQl . 

of magnitudes, or to weigh different invariants by different 1° a video shot where object and camera motions are 

weights, have not produced agreeable matching results. In prominent, a representative image is not sufficient for the 

fact, the inventors determined that clustering based on faithful analysis of the image set it represents. Yet it is very 

matching by Euclidean distance on the moment invariant 30 computational burdensome to include every frame in the 

gives inferior results in comparison to those produced using sn ot for processing purposes, as much of the temporal 

color only. information is redundant It is important to balance the two 

goals: to preserve as much of the temporal variations as 

Correlation of images possible and to reduce the computing load needed to process 

The inventors discovered that measuring correlation 35 many video frames in a given shot. In the present system, the 

between two small images (even the DC images) does give inventors chose a good but nevertheless greatly reduced 

a very good indication of similarity (it is actually dissimi- representative set of frames to represent a video shot, 

larity in the definition below) in these images. By using the However, it should be noted that while a representative 

sum of absolute difference, the correlation between two image is used to represent a shot in the presentation of the 

images, i m and f„ is commonly computed by: 40 visual results, the analysis and clustering is not confined to 

only one such representative image. 

£(«,«)+ S 2 J m (j,k)-f n (JM (8) . 

k=i Clustering Algonthms 

Hie correlation is known to be very sensitive to transla- 45 A proximity matrix for the video shots are built based on 

tion of objects in the images. However, when applied to the the dissimilarity measures defined for color, shape and 

much reduced images, effects due to object translation correlation for their representative image sets. Each entry of 

appear to lessen to a great degree. the matrix is a proximinity index. Here the inventors fol- 

The inventors found that correlation measures can achieve lowed the definitions in a paper by Jain and Dubes, "Algo- 

clustering results as good as those done by color. To further 5Q rithms for Clustering Data", Prentice Hall, 1988, to define 

reduce the storage space needed and increase the computa- the proximity index on dissimilarity: A proximity index 

tional efficiency, the inventors devised a simple way to between the ith and kth patterns (s,. and sj is denoted by 

measure correlation using the luminance projection defined d(i,k) and must satisfy the following three properties: 
as follows. For a given image f ro (jjc)j«=l,2, ... ^ and 

k=l,2, . . . , K, the luminance projection for the 1th row is: <Km)=o, for all i (12) 

55 

J (9) dOk)-d(k,i), for all (ijt) (13) 
iV(0- 2 Uim{f m (j,l)}; K } 

> =1 d(i J k)§0,forall(i f k) (14) 

and luminance projection for the 1th column is: Fof two shots ^ and ^ ^ proximitv mdex caQ ^ D(i>k) 

K 60 for color, e(ijc) for correlation and U(i,k) for shape, or any 

p c m (i) = i' Lum{f m {i,k)}; ^ * combination of the three. Note that the three measures of 

kal dissimilarity are not transitive. 

This is an array of size K+J and does not require the JxK The proximity matrix serves as the only input to a 

storage for the whole image for correlation calculation in the clustering algorithm. A clustering is a type of classification 

later stages. To test the similarity of images f„, and f„, the 65 imposed on a finite set of objects, whose relationship to each 

sum of absolute difference of the row and column projec- other is represented by the proximity matrix in which rows 

tions is used as follows: and columns correspond to objects. 
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To allow for user interaction, the inventors chose hierar- The scene transition graph makes use of the temporal 

chical clustering over partitioual clustering (the latter gives relations to mark the edges of the directed graph, 

a single partition, and the former is a sequence of nested Nevertheless, this structure can also be applied to the spatial 

partilional classifications). The present system algorithm relations within an image or spatio-temporal relations within 

first groups the pair of shots that are most similar together, 5 a shot. In that case, one can further enhance the hierarchy of 

and then proceeds to group other shots by their proximity tne g ra P n t0 represent an even better understanding of video 

values. The proximity values between clusters are updated in antI atlcl t0 tne machine classification capability, 

the process of grouping. However, any clustering algorithms. Before all the issues are resolved and properly defined, 

that give partitions similar to human classifications based on one needs t0 investigate closely how the users interact with 

visual characteristics of the video shots will do for this 10 mebrowsin g svslem - The present interface provides such an 

purpose. In addition, only a good (not perfect) clustering °P tl0 u n L for ? c ^ rs t0 S rou P hierarchically the nodes of the 

scheme can be used as the initial step to the building of scene ^ph based on their subjective criteria. At the same time, by 

. u tm * * 11 * closely monitoring what catena they impose at different 

transition graph. The present system allows users to 1^1 j l * l l - ■* • i_ j j 

1 ■ . u 1- .u i 1 ■« .* levels, and how these subjective catena can be decomposed 

re-clas S1 fy the mismatches. Furthermore, underclassification mtQ mofe COflcrete for machme automatio £ the 

is always preferred to overclassification to better acconimo- is hierarchy for the graph can be improved. In FIGS. 6 and 7, 

date the hierarchical scene transition graph. 11ns means that potentially useful top level scene transition graphs of the 

it is preferred to have a shot lett as a single cluster than to respective video sequences of FIGS. 3a and 3i>, respectively, 

have it grouped into other clusters not in close match. are s h own 

The user is given the flexibility to interactively select the In FIG / 4> an scene transition graph of the 1992 
number of clusters desired or to set caps on the dissimilarity 20 Democratic Convention video sequence is shown in the 
values between individual shots allowed in a cluster. In test central portion of the figure A node ^tcd by 
trials of the present system, after the initial shot partitions, a ^ ^ iUustratcd in the center G f lhe initial scene transition 
the user only needs to slightly adjust the knobs to change graph via a bold anw and darkening 0 f the edges of the 
these partitions to yield satisfactory results, often with less highlighted node. The top window shows the video shots 
than four such trials. FIGS. 3a and 3b show clustering results 25 represented by the highlighted node. The bottom window in 
on two sequences: a 16-minute Democratic Convention the figure shows the video shots of the scene transition graph 
video sequence, and a News Report, respectively. The arranged in temporal order . Similarly in FIG. 5, the upper 
directed scene transition graph is laid out using the algo- and primary portion shows an mitial ^ transition graph 
nthms disclosed by Laszlo Szirmay-Kalos, in a paper 0 f a News Report video sequence. A lower center window 
"Dynamic layout algorithm to display general graphs", in 30 shows the video shots of the scene transition graph arranged 
Graphics Gems /V, pp. 505-517, Academic Press, Boston, m tDe temporal order. The bottom window shows the video 
1994. FIGS. 4 and 5 show the sample interface and graph shots associated with the highlight node, 
layout of the two above-mentioned video sequences, based with further reference to FIG. 6, a top-level scene tran- 
on the results m FIGS. 3a and 36, respectively. Each node s ition graph of the 1992 Democratic Convention video 
represents a collection of shots, clustered by the method 35 sequence is shown. The lower window shows the contents or 
described above. For simplicity, only one frame is used to video shots of the highlighted node. The nodes of the scene 
represent the collection of shots. A means is also provided transition graph are typically representative of complete 
for the users to re-arrange the nodes, group the nodes scene changes relative to the nodes of the initial scene 
together to form further clusters, and ungroup some shots transition graph of FIG. 4. Similar comments apply to FIG. 
from a cluster. This enables the user to organize the graphs 40 7 relative to FIG. 5, for the News Report video sequence, 
differently to get a better understanding of the overall A block schematic diagram of a hardware system pro- 
structures, grammed for processing various methods of the invention is 

Image attributes have served as the measurement of shown in FIG. 8. Flowcharts for the programming for 

similarity between video shots at the low levels of the scene selected ones of the steps of FIG. 1 are shown in the 

transition graph hierarchy. In tests of the present method and 45 flowcharts of FIGS. 9 10 11 and 12. 
system, the matching of shots based on primitive visual 

characteristics such as color and shape resembles the process Conclusions 

in which the users tend to classify these video shots, when The present method provides a general framework of 

they have no prior knowledge of the video sequence given hierarchical scene transition graphs for video browsing. Also 

to them. However, in addition to color and shape, the users 50 presented is a new browsing method through the identifica- 

are capable of recognizing the same people (in different tion of video shots, the clustering of video shots by 

backgrounds, clothes, and under varying lighting similarity, and the presentation of the content and structure 

conditions) in different shots, and classify these shots to be to the users via the scene transition graph. This multi-step 

in the same cluster. Further classification and grouping is analysis of the video clip reveals the relationship between 

possible after the users have acquired more understanding of 55 visual and temporal information in the clip. The combination 

the video sequences. captures the important relationships within a scene and 

This suggests that automatic clustering schemes for the between scenes in a video, thus allowing the analysis of the 

scene transition graph building can be made at multiple underlying story structure without any a priori knowledge, 

levels. At each level, a different criterion is imposed. The The feasibility of this framework has been demonstrated 

inventors considered that vision techniques are the keys in 60 by an implementation that automates the process of shot 

the lower levels: image attributes contribute the clustering identification, clustering of similar shots, and construction of 

criterion at the bottom level, image segmentation (e.g. scene transition graph. All processing are performed on 

foregrounds and backgrounds) and object recognition can be DC-sequences that are extracted directly from compressed 

the next level of classification. In the lop levels of the MPEG and Motion JPEG sequences. The greatly reduced 

hierarchy, subgraph properties and temporal structures, such 65 data size permits fast computation. For example, the chis- 

as discovering repeated self-loops and subgraph tering on the 16-minute Democratic Convention 1992 

isomorphism, can be explored to further condense the graph. sequence of FIG. 3a is performed in seconds on an SGI Indy. 
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The scene transition graph can be further refined hierar- 
chically. Each level has the flexibility of being constructed 
via different criteria to reflect both visual and semantic 
structure that closely resemble human perception and under- 
standing. Thus the scene transition graph provides users with 5 
a tool for better understanding and use of complex videos. 

Although various embodiments of the invention are 
shown and described herein, they are not meant to be 
limiting. Those of skill in the art may recognize certain 
modifications to these embodiments, which modifications 10 
are meant to be covered by the spirit and scope of the 
appended claims. 

What is claimed is: 

1. A video browser comprising: 

node defining means for defining a plurality of nodes, 15 
each of the plurality of nodes representing a collecting 
of video shots; 

temporal relationship defining means for defining at least 
one temporal relationship between at least first and 2Q 
second ones of the plurality of nodes; and 

a display for displaying graphical representations of the 
plurality of nodes and the at least one temporal rela- 
tionship. 

2. The video browser of claim 1, wherein the video shots 2 5 
of each of the plurality of nodes are interrelated by at least 
one of color, spatial correlation and shape. 

3. The video browser of claim 1, wherein the video shots 
of each of the plurality of nodes are arranged in temporal 
order. 30 

4. The video browser of claim. 1, wherein the graphical 
representation of the temporal relationship is an edge. 

5. The video browser of claim 1, wherein the graphical 
representations of the plurality of nodes and the at least one 
temporal relationship are displayed as a hierarchical scene 35 
transition graph. 

6. The video browser of claim 5, wherein the hierarchical 
scene transition graph includes a plurality of levels and 
wherein each of the levels can be constructed according to 
different criteria. 



7. The video browser of claim 1, further comprising 
means for grouping video shots of at least two of the 
plurality of nodes to form an additional node. 

8. The video browser of claim 1, further comprising 
means for ungrouping video shots from at least one of the 
plurality of nodes to form an additional node. 

9. A method for presenting video shots for browsing, 
comprising the steps of: 

defining a plurality of nodes, each of the plurality of nodes 
representing a collection of the video shots; 

defining at least one temporal relationship between at 
least first and second ones of the plurality of nodes; and 

displaying graphical representations of the plurality of 
nodes and the at least one temporal relationship. 

10. The method of claim 9, wherein the video shots of 
each of the plurality of nodes are interrelated by at least one 
of color, spatial correlation, and shape. 

11. The method of claim 9, wherein the video shots of 
each of the plurality of nodes are arranged in temporal order. 

12. The method of claim 9, wherein the graphical 
representation^) of the temporal relationship is an edge. 

13. The method of claim 9, wherein the graphical repre- 
sentations of the plurality of nodes and the at least one 
temporal relationship are displayed as a hierarchical scene 
transition graph. 

14. The method of claim 13, wherein the hierarchical 
scene transition graph includes a plurality of levels and 
wherein each of the levels can be constructed according to 
different criteria. 

15. 'ilie method of claim 9, further comprising the step of 
grouping video shots of at least two of the plurality of nodes 
to form an additional node. 

16. The method of claim 9, further comprising the step of 
ungrouping video shots from at least one of the plurality of 
nodes to form an additional node. 



06/17/2004, EAST Version: 1.4.1 



This Page is Inserted by IFW Indexing and Scanning 
Operations and is not part of the Official Record 



Defective images within this document are accurate representations of the original 
documents submitted by the applicant. 

Defects in the images include but are not limited to the items checked: 



□ BLURRED OR ILLEGIBLE TEXT OR DRAWING 

□ SKEWED/SLANTED IMAGES 

□ COLOR OR BLACK AND WHITE PHOTOGRAPHS 

□ GRAY SCALE DOCUMENTS 

□ LINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCE(S) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY 

□ OTHER: 

IMAGES ARE BEST AVAILABLE COPY. 
As rescanning these documents will not correct the image 
problems checked, please do not report these problems to 
the IFW Image Problem Mailbox. 



BEST AVAILABLE IMAGES 




BLACK BORDERS 




□/IMAGE CUT OFF AT TOP, BOTTOM OR SIDES 



FADED TEXT OR DRAWING 



